Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

IDa-Det: An Information Discrepancy-Aware Distillation for 1-bit Detectors

175

Proposal Pair ܴଵ

௧ǡ ܴଵ

௦

Proposal Pair ܴଶ

௧ǡ ܴଶ

௦

Proposal Pair ܴଷ

௧ǡ ܴଷ

௦

Proposal Pair ܴସ

௧ǡ ܴସ

௦

Paired ܴ௡^௦in Student

ܴ௡^௦in Student

ܴ௡^௧in Teacher

Paired ܴ௡^௧in Teacher

FIGURE 6.16

Illustration for the generation of the proposal pairs. Every single proposal in one model

generates a counterpart feature map patch in the same location as the other model.

channel-wise proposal feature and measure the discrepancy as

εn =

c=1

||(R^t

n;c ⁻^R^s

n;c⁾^T^Σ⁻¹

n;c⁽^R^t

n;c ⁻^R^s

n;c⁾^||²^,

(6.83)

where Σn;c denotes the covariance matrix of the teacher and the student in the c-th channel

of the n-th proposal pair. The Mahalanobis distance takes into account both the pixel-

level distance between proposals and the diﬀerences in statistical characteristics in pair of

proposals.

To select representative proposals with maximum information discrepancy, we ﬁrst de-

ﬁne a binary distillation mask mn as

mn =

1, if pair (R^t

n^{, R}^s

n^{) is selected}

0, otherwise

(6.84)

where mn = 1 denotes that the distillation will be applied on this proposal pair; otherwise,

it remains unchanged. For each pair of proposals, only when their distribution is quite

diﬀerent can the student model learn from the teacher counterpart where a distillation

process is needed.

On the basis of the derivation above, discrepant proposal pairs will be optimized through

distillation. To distill the selected pairs, we resort to maximizing the conditional probability

p(R^s

n^|^R^t

n^{). That is, after distillation or optimization, the feature distributions of the teacher}

proposals and the student counterparts become similar. To this end, we deﬁne p(R^s

n^|^R^t

n⁾

with mn, n ∈{1, · · · , NT + NS} in consideration as

p(R^s

n^|^R^t

n^;^mⁿ⁾^∼^mⁿ^N⁽^μ^t

n^{, σ}^t

2) + (1 −mn)N(μs

n^{, σ}^s

2).

(6.85)

Subsequently, we introduce a bilevel optimization formulation to solve the distillation prob-

lem as

max

R^sn

p(R^s

n^|^R^t

n^;^m^∗⁾^,^∀ⁿ^∈{⁰^,^{· · ·}^{, N}^T ⁺^N^S^}^,

s.t. m^∗= arg max

NT +NS

n=1

mnεn,

(6.86)

where m = [m1, · · · , mNT +NS] and ||m||0 = γ · (NT + NS). γ is a hyperparameter. In

this way, we select γ · (NT + NS) pairs of proposals that contain the most representative